[Chapter 6] Exploring the nutrition data

[DSLC stages]: EDA

We examined and cleaned the nutrition data in the file 01_cleaning.qmd. In this document, you will find the code for conducting an EDA of the nutrition data. This document is far from exhaustive and you are encouraged to add some sections to conduct your own explorations of this data.

In each code file that uses the cleaned version of the data, it is good practice to load in the original “raw” (uncleaned) data and then clean it using the cleaning function you wrote. It is often helpful to keep a copy of the original uncleaned data in your environment too.

First we will load in the libraries that we will use. Note that if you don’t have superheat, we recommend downloading it from github using the devtools code that is commented out, rather than from CRAN (the CRAN version is outdated).

library(tidyverse)
library(patchwork)
# to install the superheat library run the following code:
# library(devtools)
# devtools::install_github("rlbarter/superheat")
library(superheat)

Next, we will load the cleaning and pre-processing functions that we wrote.

# source the cleaning function code
source("functions/cleanFoodData.R")
source("functions/preProcessFoodData.R")

And load in the raw data objects and clean them. We will create a cleaned version of both the FNDDS and Legacy datasets. We will also pre-process each dataset, by log-transforming and SD-scaling them (we won’t center them).

# load in all of the raw data objects
nutrient_amount <- read_csv("../data/food_nutrient.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 5370480 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): id, fdc_id, nutrient_id, amount, data_points, derivation_id, min, m...
lgl (2): median, footnote

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
food_name <- read_csv("../data/food.csv")
Rows: 335142 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): data_type, description
dbl  (2): fdc_id, food_category_id
date (1): publication_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nutrient_name <- read_csv("../data/nutrient_name.csv")
Rows: 187 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): nutrient_name
dbl (1): nutrient_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean the FNDDS data
food_fndds <- cleanFoodData(.nutrient_amount_data = nutrient_amount, 
                            .food_name_data = food_name,
                            .nutrient_name_data = nutrient_name, 
                            # fndds is the default value
                            .select_data_type = "survey_fndds_food")
# Preprocess the FNDDS data
food_fndds_log_scaled <- preProcessFoodData(food_fndds, 
                                                     .log_transform = TRUE,
                                                     .center = FALSE,
                                                     .scale = TRUE,
                                                     .remove_fat = FALSE)
# Clean the Legacy data
food_legacy <- cleanFoodData(.nutrient_amount_data = nutrient_amount, 
                             .food_name_data = food_name,
                             .nutrient_name_data = nutrient_name, 
                             .select_data_type = "sr_legacy_food")
# Preprocess the Legacy data
food_legacy_log_scaled <- preProcessFoodData(food_legacy,
                                                     .log_transform = TRUE,
                                                     .center = FALSE,
                                                     .scale = TRUE,
                                                     .remove_fat = FALSE)

1 High-level summary of the data

Here are some histograms of the nutrients.

Ironically, the easiest way to plot these using ggplot is to first revert the data back to a long format.

food_fndds |> 
  # convert the data to a long format (ignoring the first description column)
  pivot_longer(2:ncol(food_fndds), 
               names_to = "nutrient", 
               values_to = "amount") |>
  # plot histograms
  ggplot() +
  geom_histogram(aes(x = amount)) +
  facet_wrap(~nutrient, scales = "free", ncol = 3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distributions of each of these nutrients is fairly skewed.

After log-transforming and standardizing these nutrients however, some of them look substantially more symmetrical. Some of the nutrients that were particularly skewed are still fairly skewed.

food_fndds_log_scaled |> 
  pivot_longer(2:ncol(food_fndds), 
               names_to = "nutrient", 
               values_to = "amount") |> 
  ggplot() +
  geom_histogram(aes(x = amount)) +
  facet_wrap(~nutrient, scales = "free", ncol = 3)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The following table shows a random sample of 15 food items.

set.seed(3243581)
food_fndds |> 
  sample_n(15)  |>
  arrange(description)
description protein fat carbohydrates calories alcohol moisture caffeine theobromine total_dietary_fiber calcium iron magnesium phosphorus potassium sodium zinc copper selenium retinol beta_carotene alpha_carotene alpha_tocopherol cryptoxanthin lycopene lutein_zeaxanthin vitamin_c thiamine riboflavin niacin vitamin_b6 folate vitamin_b12 total_choline phylloquinone cholesterol saturated_fat butyric_acid caproic_acid caprylic_acid capric_acid lauric_acid myristic_acid palmitic_acid stearic_acid oleic_acid linoleic_acid alpha_linolenic_acid arachidonic_acid docosahexaenoic_acid palmitoleic_acid parinaric_acid eicosenoic_gadoleic_acid ecosapentenoic_acid erucic_acid docosapentaenoic_acid monounsaturated_fat polyunsaturated_fat
Bread, marble rye and pumpernickel, toasted 9.34 3.63 53.08 285 0 31.10 0 0 6.4 80 3.11 44 137 182 663 1.25 0.204 34.0 0 4 0 0.36 1 0 56 0.4 0.382 0.368 4.181 0.082 103 0.00 16.0 1.3 0 0.688 0.000 0.000 0.000 0.000 0.000 0.012 0.423 0.253 1.424 0.812 0.066 0.000 0.000 0.013 0.000 0.003 0.000 0.000 0.000 1.441 0.878
Cheese sandwich, Cheddar cheese, on white bread, with mayonnaise 12.97 22.82 26.75 366 0 34.85 0 0 1.4 333 1.94 22 218 95 567 1.72 0.065 22.0 121 32 0 0.77 0 0 24 0.0 0.287 0.283 2.494 0.070 68 0.41 17.7 20.6 41 8.605 0.231 0.194 0.125 0.299 0.345 1.099 4.294 1.716 4.961 5.886 0.786 0.026 0.001 0.202 0.000 0.057 0.004 0.003 0.006 5.690 6.737
Hamburger, 1 medium patty, with condiments, on bun, from fast food / restaurant 15.68 11.60 22.14 256 0 48.81 0 0 1.1 87 1.73 24 121 225 374 3.04 0.101 24.8 0 26 1 0.06 2 668 21 0.3 0.160 0.201 4.650 0.148 22 1.66 33.3 5.0 40 4.785 0.000 0.000 0.000 0.000 0.000 0.352 2.740 1.513 4.833 0.999 0.105 0.019 0.000 0.438 0.057 0.038 0.000 0.000 0.000 5.404 1.179
Herring, smoked, kippered 24.58 12.37 0.00 217 0 59.70 0 0 0.0 84 1.51 46 325 447 918 1.36 0.135 52.6 40 0 0 1.54 0 0 0 1.0 0.126 0.319 4.402 0.413 14 18.70 95.0 0.1 82 2.791 0.000 0.000 0.000 0.000 0.016 0.758 1.851 0.149 2.074 0.178 0.141 0.082 1.179 0.851 0.293 0.986 0.970 1.149 0.075 5.112 2.919
Ice cream sandwich, made with light chocolate ice cream 5.58 6.60 38.54 236 0 48.59 3 92 1.5 125 0.93 26 91 185 55 0.54 0.158 6.8 46 103 0 0.21 0 0 4 0.9 0.036 0.111 0.306 0.027 7 0.11 16.6 0.7 21 3.671 0.172 0.059 0.060 0.107 0.069 0.538 1.701 0.790 1.785 0.404 0.070 0.000 0.000 0.152 0.000 0.001 0.000 0.000 0.000 1.938 0.477
Oxtail soup 8.40 4.87 1.37 85 0 84.68 0 0 0.3 11 1.12 9 68 105 139 2.75 0.054 7.3 11 237 85 0.14 0 106 59 1.9 0.029 0.082 0.804 0.093 6 0.65 32.8 10.2 31 2.212 0.053 0.033 0.020 0.044 0.045 0.225 1.156 0.621 1.719 0.152 0.014 0.011 0.000 0.137 0.000 0.004 0.001 0.000 0.004 1.861 0.179
Pancakes, plain, reduced fat, from fozen 7.56 9.54 36.09 262 0 43.98 0 0 2.2 208 1.51 20 329 171 667 0.75 0.064 11.3 71 12 0 1.12 1 0 101 0.0 0.221 0.292 1.423 0.151 46 0.35 45.4 6.8 61 2.811 0.087 0.058 0.040 0.071 0.075 0.232 1.566 0.615 3.232 2.424 0.291 0.027 0.008 0.070 0.000 0.026 0.000 0.000 0.001 3.337 2.761
Pizza, cheese, with vegetables, from restaurant or fast food, medium crust 10.31 8.69 30.71 242 0 48.08 0 0 2.2 170 2.26 23 196 171 535 1.22 0.101 17.9 55 96 1 0.77 0 1714 74 4.8 0.354 0.177 3.455 0.090 85 0.38 15.4 6.5 15 3.999 0.090 0.072 0.049 0.126 0.157 0.539 2.059 0.768 2.139 1.318 0.159 0.011 0.000 0.105 0.002 0.033 0.004 0.002 0.004 2.336 1.514
Rice, brown, with vegetables and gravy, NS as to fat added in cooking 2.24 2.53 18.21 105 0 76.10 0 0 1.7 9 0.49 27 73 89 213 0.51 0.080 3.5 0 371 173 0.34 0 0 114 0.6 0.120 0.073 1.685 0.089 9 0.00 10.4 6.0 1 0.486 0.000 0.000 0.001 0.009 0.002 0.009 0.362 0.085 0.967 0.783 0.082 0.001 0.000 0.032 0.000 0.010 0.000 0.000 0.000 1.011 0.867
Spinach, creamed, baby food, strained 2.50 1.30 5.70 37 0 89.60 0 0 1.8 89 0.62 55 54 191 49 0.31 0.060 2.4 23 2456 0 0.83 0 0 4505 8.7 0.015 0.104 0.216 0.075 61 0.06 16.1 196.7 5 0.702 0.035 0.019 0.013 0.025 0.029 0.112 0.310 0.130 0.291 0.060 0.071 0.000 0.000 0.028 0.000 0.000 0.000 0.000 0.000 0.332 0.133
Tomato rice soup, prepared with water 0.82 1.06 8.54 47 0 88.52 0 0 0.7 11 0.31 2 13 129 319 0.20 0.055 0.9 1 124 0 0.86 0 8591 1 5.8 0.024 0.020 0.411 0.030 6 0.00 7.8 1.5 1 0.200 0.000 0.005 0.000 0.000 0.005 0.015 0.110 0.050 0.230 0.435 0.090 0.000 0.000 0.005 0.000 0.000 0.000 0.000 0.000 0.235 0.525
Tomato with corn and okra, cooked, made with butter 1.51 2.48 6.72 49 0 87.97 0 0 2.0 36 0.39 18 29 154 323 0.26 0.054 0.6 16 156 2 0.40 6 1047 336 10.1 0.284 0.064 0.818 0.108 26 0.00 10.4 12.3 5 1.349 0.079 0.049 0.029 0.063 0.063 0.182 0.604 0.260 0.614 0.255 0.015 0.000 0.000 0.025 0.000 0.004 0.000 0.000 0.000 0.643 0.275
Tomato, green, pickled 1.09 0.24 8.24 36 0 89.54 0 0 1.1 15 0.48 10 26 180 125 0.10 0.072 0.6 0 347 54 0.40 34 0 5 24.4 0.050 0.037 0.426 0.087 9 0.00 6.7 7.6 0 0.030 0.000 0.000 0.000 0.000 0.000 0.000 0.022 0.006 0.046 0.070 0.008 0.000 0.000 0.001 0.000 0.004 0.000 0.009 0.000 0.061 0.078
Turkey with gravy, dressing, potatoes, vegetable, frozen meal 6.97 3.89 16.32 128 0 71.16 0 0 1.3 27 0.79 18 106 230 326 0.48 0.056 7.9 34 124 0 0.24 0 0 327 0.8 0.138 0.252 3.057 0.177 24 0.33 17.1 11.5 14 0.877 0.000 0.000 0.000 0.007 0.010 0.050 0.572 0.237 1.180 0.965 0.102 0.020 0.000 0.055 0.000 0.000 0.000 0.000 0.000 1.235 1.087
Vegetable combination, excluding carrots, broccoli, and dark-green leafy; cooked, no sauce, made with margarine 2.52 2.06 11.99 71 0 82.26 0 0 2.7 20 0.71 19 53 163 275 0.42 0.067 0.6 22 409 23 0.70 31 0 907 12.7 0.097 0.066 0.845 0.219 32 0.00 17.2 17.8 0 0.407 0.000 0.000 0.000 0.002 0.006 0.005 0.234 0.144 0.618 0.823 0.099 0.000 0.000 0.002 0.000 0.006 0.000 0.000 0.000 0.626 0.922

Next, let’s look at a heatmap of the entire dataset. There are clearly some columns that are similar, but arranged alphabetically, it is very hard to tease apart what they are.

# create a version of the data to visualize in the heatmap
food_fndds |>
  dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
  mutate_all(~(. - mean(.)) / sd(.)) |>
  # convert the dataset to a long-form
  pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
  # create a row ID variable
  group_by(nutrient) |>
  mutate(id = 1:n()) |>
  ungroup() |>
  # plot a heatmap
  ggplot() +
  geom_tile(aes(x = nutrient, y = id, fill = value)) +
  # choose the fill gradient
  scale_fill_gradientn("Scaled nutrient value",
                       colors = c("white", "grey80", "grey30", "black"),
                       values = c(0, 0.015, 0.02, 1)) +
  # do some formatting stuff
  scale_y_continuous(expand = c(0, 0)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top") +
  labs(x = NULL, y = NULL)

Figure 1: A heatmap of all nutrient variables (scaled to a common scale)

Let’s create an explanatory version of this heatmap that arranges the nutrient columns so that similar nutrients are grouped together. We made these grouping decisions based on our own domain knowledge as well as an initial version of this heatmap. We will also arrange the food items/rows using a “clustering” algorithm that we will introduce in Chapter 8.

Figure 2 shows the result.

# create a version of the data to visualize in the heatmap
food_heatmap <- food_fndds |>
  dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
  mutate_all(~(. - mean(.)) / sd(.))

# define a vector for manually arranging the nutrient columns into nutrient groups
nutrient_order <- c(# vitamins
  "vitamin_b6", "alpha_tocopherol",  "riboflavin", "thiamine", "folate", "niacin", "vitamin_c", "vitamin_b12", "retinol",  "beta_carotene", "lutein_zeaxanthin", "phylloquinone", "lycopene", "cryptoxanthin", "alpha_carotene",
  #major minerals
  "sodium", "potassium", "calcium", "phosphorus", "magnesium", "total_choline",
  # trace minerals
  "iron", "zinc", "selenium", "copper",
  # carbohydrates
  "carbohydrates", "total_dietary_fiber",
  # fats
  "fat", "saturated_fat", "monounsaturated_fat", "polyunsaturated_fat", "palmitic_acid", "stearic_acid", "oleic_acid", "linoleic_acid", "arachidonic_acid", "palmitoleic_acid", "alpha_linolenic_acid", "eicosenoic_gadoleic_acid", "cholesterol", "butyric_acid", "caproic_acid", "caprylic_acid", "capric_acid", "myristic_acid", "lauric_acid", "docosapentaenoic_acid", "docosahexaenoic_acid",  "parinaric_acid", "ecosapentenoic_acid", "erucic_acid", 
  # calories
  "calories",
  # protein
  "protein")

# do a fancy clustering to decide the order of the rows/food items
# you'll learn this in chapter 8
food_order <- hclust(dist(food_heatmap))$order


food_heatmap[food_order, nutrient_order] |>
  # convert the dataset to a long-form
  pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
  # ensure that the nutrient columns appear in the correct order by 
  # reordering the factor levels
  mutate(nutrient = fct_inorder(nutrient)) |>
  # create a row ID variable
  group_by(nutrient) |>
  mutate(id = 1:n()) |>
  ungroup() |>
  # plot a heatmap
  ggplot() +
  geom_tile(aes(x = nutrient, y = id, fill = value)) +
  # add some boundary lines
  geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
             color = "grey20", linewidth = 1.2) +
  geom_hline(yintercept = c(0, nrow(food_heatmap)),
             color = "grey20", linewidth = 1.2) +
  # choose the fill gradient
  scale_fill_gradientn("Scaled nutrient value",
                       colors = c("white", "grey80", "grey30", "black"),
                       values = c(0, 0.015, 0.02, 1)) +
  # do some formatting stuff
  scale_y_continuous(expand = c(0, 0)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top") +
  labs(x = NULL, y = NULL)

Figure 2: A heatmap of all nutrient variables (scaled to a common scale)

1.1 PCS evaluation of heatmap exploration

Predictability

Let’s see if these same patterns exist in the Legacy food dataset. Below, we recreate the heatmap using the Legacy dataset. Note that we are coloring missing values blue.

# create a version of the data to visualize in the heatmap
food_legacy_heatmap <- food_legacy |>
  select(one_of(colnames(food_fndds))) |>
  dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
  mutate_all(~(. - mean(., na.rm = TRUE)) / sd(., na.rm = TRUE))

# do a fancy clustering to decide the order of the rows/food items
# you'll learn this in chapter 8
# impute missing values with mean 
food_legacy_heatmap_imputed <- food_legacy_heatmap |> mutate(across(where(is.numeric), 
                                                                    ~if_else(is.na(.), mean(., na.rm = TRUE), .)))
food_legacy_order <- hclust(dist(food_legacy_heatmap_imputed))$order


food_legacy_heatmap[food_legacy_order, nutrient_order] |>
  # convert the dataset to a long-form
  pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
  # ensure that the nutrient columns appear in the correct order by 
  # reordering the factor levels
  mutate(nutrient = fct_inorder(nutrient)) |>
  # create a row ID variable
  group_by(nutrient) |>
  mutate(id = 1:n()) |>
  ungroup() |>
  # plot a heatmap
  ggplot() +
  geom_tile(aes(x = nutrient, y = id, fill = value)) +
  # add some boundary lines
  geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
             color = "grey20", linewidth = 1.2) +
  geom_hline(yintercept = c(0, nrow(food_legacy_heatmap)),
             color = "grey20", linewidth = 1.2) +
  # choose the fill gradient
  scale_fill_gradientn("Scaled nutrient value",
                       colors = c("white", "grey80", "grey30", "black"),
                       values = c(0, 0.015, 0.02, 1), na.value = "lightsteelblue1") +
  # do some formatting stuff
  scale_y_continuous(expand = c(0, 0)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top") +
  labs(x = NULL, y = NULL)

Figure 3: A heatmap of all nutrient variables (scaled to a common scale). Missing values are shown as blue entries.

Since the food items between the FNDDS and Legacy datasets are different, we don’t expect to see the exact same patterns, but we do see that columns that had similar patterns in the FNDDS dataset also seem to have similar patterns in the Legacy dataset.

Stability to transformation judgment call

Let’s investigate the stability of this heatmap result to the judgment call of log-transforming the data (the original version was scaled and centered, but not log-transformed).

When we use the log-transformed scaled and centered data, the trends are fairly similar. Note, however, that for both plots we had to do a lot of finagling with the scale_fill_gradientn() values argument that controls the positions of the color transition.

# create a version of the data to visualize in the heatmap
food_heatmap_log <- food_fndds_log_scaled |>
  dplyr::select(-description,-caffeine, -moisture, -alcohol, -theobromine) |>
  mutate_all(~. - mean(.))


food_heatmap_log[food_order, nutrient_order] |>
  # convert the dataset to a long-form
  pivot_longer(everything(), names_to = "nutrient", values_to = "value") |>
  # ensure that the nutrient columns appear in the correct order by 
  # reordering the factor levels
  mutate(nutrient = fct_inorder(nutrient)) |>
  # create a row ID variable
  group_by(nutrient) |>
  mutate(id = 1:n()) |>
  ungroup() |>
  # plot a heatmap
  ggplot() +
  geom_tile(aes(x = nutrient, y = id, fill = value)) +
  # add some boundary lines
  geom_vline(xintercept = c(0.5, 15.5, 21.5, 25.5, 27.5, 51.5, 52.5, 53.5),
             color = "grey20", linewidth = 1.2) +
  geom_hline(yintercept = c(0, nrow(food_heatmap)),
             color = "grey20", linewidth = 1.2) +
  # choose the fill gradient
  scale_fill_gradientn("Scaled nutrient value",
                       colors = c("white", "grey80", "grey30", "black"),
                       values = c(0, 0.15, 0.17, 1)) +
  # do some formatting stuff
  scale_y_continuous(expand = c(0, 0)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "top") +
  labs(x = NULL, y = NULL)

Figure 4: A heatmap of all nutrient variables (mean-centered, scaled to a common scale and log-transformed)

2 Correlation exploration

Prompted by the heatmaps above, we next ask which nutrient variables are correlated with one another.

The heatmap below in Figure 5 (created using superheat() this time) displays the correlation matrix of the nutrients. The pretty.order.rows = TRUE and pretty.order.cols = TRUE arguments tell superheat() to use a clustering algorithm (see Chapter 8) under-the-hood to order the rows/columns so that similar rows/columns will appear next to one another.

It is even clearer from this figure that there are some groups of highly correlated nutrients.

superheat(cor(select(food_fndds, -description, -caffeine, 
                     -moisture, -alcohol, -theobromine)),
          heat.pal = c("#F6AE2D", "white", "#18678B"),
          heat.pal.values = c(0, 0.16, 1),
          bottom.label.text.angle = 90,
          bottom.label.size = 0.2,
          bottom.label.text.size = 4,
          bottom.label.text.alignment = "right",
          left.label.size = 0.2,
          left.label.text.size = 4,
          left.label.text.alignment = "right",
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE)

Figure 5: A heatmap of the correlation matrix

The question is, how similar are these groups to the nutrient groups that we identified in the previous section?

2.1 PCS evaluations of correlation matrix

Predictability

Let’s see if the same relationships hold in the Legacy food dataset. This time, however, rather than keeping rows with missing values, we remove them for now since correlations can’t be computed for data with missing values. Note that this may introduce bias in the data, but for now we won’t worry too much about it (but we will be aware of it).

The groupings seem to be very clear in this Legacy dataset too.

food_legacy |>
  select(one_of(colnames(food_fndds)), -description, -caffeine, 
         -moisture, -alcohol, -theobromine) |>
  drop_na() |>
  cor() |>
  superheat(heat.pal = c("#F6AE2D", "white", "#18678B"),
            heat.pal.values = c(0, 0.2, 1),
            bottom.label.text.angle = 90,
            bottom.label.size = 0.2,
            bottom.label.text.size = 4,
            bottom.label.text.alignment = "right",
            left.label.size = 0.2,
            left.label.text.size = 4,
            left.label.text.alignment = "right",
            pretty.order.rows = TRUE,
            pretty.order.cols = TRUE)

Figure 6: A heatmap of the correlation matrix

Stability to visualization judgment calls

As a quick stability evaluation, we will arrange the columns and rows of the correlation matrix in the same order as the heatmaps in Figure 1.

superheat(cor(select(food_fndds, -description, 
                     -caffeine, -moisture, 
                     -alcohol, -theobromine))[nutrient_order, nutrient_order],
          heat.pal = c("#F6AE2D", "white", "#18678B"),
          heat.pal.values = c(0, 0.16, 1),
          bottom.label.text.angle = 90,
          bottom.label.size = 0.2,
          bottom.label.text.size = 4,
          bottom.label.text.alignment = "right",
          left.label.size = 0.2,
          left.label.text.size = 4,
          left.label.text.alignment = "right")

Figure 7: A heatmap of the correlation matrix

There are still some very clear nutrient groupings in the heatmap when the nutrients are ordered based on our manual ordering defined above.

Stability to pre-processing judgment call

The correlations are generally larger after log-transforming the nutrient data. The groupings are still there but are a little bit more merged together, with the exception of the vitamins that still seem fairly distinct.

superheat(cor(select(food_fndds_log_scaled, -description,
                     -caffeine, -moisture, 
                     -alcohol, -theobromine)), 
          heat.pal = c("#F6AE2D", "white", "#18678B"),
          heat.pal.values = c(0, 0.28, 1),
          bottom.label.text.angle = 90,
          bottom.label.size = 0.2,
          bottom.label.text.size = 4,
          bottom.label.text.alignment = "right",
          left.label.size = 0.2,
          left.label.text.size = 4,
          left.label.text.alignment = "right",
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE)

Figure 8: A heatmap of the correlation matrix after standardization and log-scaling the nutrients

3 Defining nutrient groups

Based on these plots and our own nutrient domain knowledge, we will define the following groups of nutrients:

  1. Vitamins: vitamin C, vitamin B6, vitamin B12, riboflavin, thiamine, folate, niacin, beta carotene, alpha carotene, lutein zeaxanthin, phylloquinone, alpha tocopherol, retinol, lycopene, cryptoxanthin

  2. Fats: fat, saturated fat, monounsaturated fat, polyunsaturated fat, cholesterol, and all of the fatty acids

  3. Major minerals: sodium, potassium, calcium, phosphorus, magnesium, total choline

  4. Trace minerals: iron, zinc, selenium, copper

  5. Carbohydrates: carbohydrates, total dietary fiber

  6. Calories: calories

  7. Protein: protein

We chose to ignore the variables “moisture”, “alcohol”, “caffeine”, and “theobromine”, since we deemed these not particularly important for making nutritional dietary choices.

Note that these groupings are very clearly a judgment call, and we could certainly have defined them differently!

4 Exploring the relationship between the food items

The above explorations gave us a vague sense of the relationships between the nutrients (we will try to formalize this in the PCA analysis file), but we don’t yet have a sense of the food items themselves.

One way to explore the food items might be to take two nutrient variables (such as sodium and potassium) and visualize the food items in the space defined by these food items.

The points in the sodium-potassium scatterplot in Figure 9 clearly exhibit a large amount of overlap (despite adding transparency with the alpha argument), but adding the text of a few food items helps tease apart some relationships in the food items. For instance, it seems that many cheese-related items (high in sodium, moderate in potassium) and fish-related food items (high in both sodium and potassium) are grouped close together.

We are visualizing the log-transformed standardized data because the points are more spread out in this transformed space and so the trends are slightly easier to pick out.

food_fndds_log_scaled |>
  ggplot() +
  geom_vline(xintercept = 0, col = "grey70") +
  geom_hline(yintercept = 0, col = "grey70") +
  geom_point(aes(x = sodium, y = potassium),
             alpha = 0.2, col = "grey50") +
  geom_text(aes(x = sodium, y = potassium, 
                label = str_trunc(description, 15)), 
            check_overlap = TRUE, hjust = 0) 

Figure 9: A scatterplot comparing sodium and potassium and overlaying some food items descriptions on top of the plot.

Let’s look at a few other pairs of variables too. To make it easy for ourselves, we can define the following reusable function. Note that the {{.}} notation in the function allows us to pass unquoted variable names as arguments in the tidyverse style (this is called “tidy evaluation”).

plotNutrientScatter <- function(var1, var2) {
  food_fndds_log_scaled |>
    ggplot() +
    geom_vline(xintercept = 0, col = "grey70") +
    geom_hline(yintercept = 0, col = "grey70") +
    geom_point(aes(x = {{ var1 }}, y = {{ var2 }}),
               alpha = 0.2, col = "grey50") +
    geom_text(aes(x = {{ var1 }}, y = {{ var2 }}, 
                  label = str_trunc(description, 15)), 
              check_overlap = TRUE, hjust = 0)
}

When we compare fat and vitamin C values, we see groups of fruits/vegetables (high in vitamin C, low in fat), groups of cereals (high in vitamin C and medium in fat), and groups of milk products (low in vitamin C and low in fat - probably these are mostly “fat free” milk products) standing out.

plotNutrientScatter(fat, vitamin_c)

When we compare alpha carotene and protein, we see meats stand out as high in both, and milk products stand out as notably low in alpha carotene (but varied in protein).

plotNutrientScatter(alpha_carotene, protein)

5 Comparing the FNDDS and Legacy data

If we’re going to use the SR legacy food dataset for conducting predictability assessments, we should explicitly compare this dataset to the FNDDS dataset.

A simple check is to compare the distributions of various nutrients. To compare the distributions of protein, fat, and sodium, we use side-by-side boxplots. Overall, the distributions look fairly similar, but are clearly not identical.

These differences could be a difference in the way the nutrients are measured, or it could be a difference in the types of food items that are included in each of the datasets.

# generate a vector of common nutrients
common_vars <- colnames(food_fndds)[colnames(food_fndds) %in% colnames(food_legacy)]

food_common <- rbind(data.frame(dataset = "fndds", food_fndds[, common_vars]),
                     data.frame(dataset = "legacy", food_legacy[, common_vars]))

gg_box_iron <- food_common |>
  ggplot() +
  geom_boxplot(aes(x = dataset, y = iron)) +
  ggtitle("Iron")

gg_box_protein <- food_common |>
  ggplot() +
  geom_boxplot(aes(x = dataset, y = protein)) +
  ggtitle("Protein")

gg_box_fat <- food_common |>
  ggplot() +
  geom_boxplot(aes(x = dataset, y = fat)) +
  ggtitle("Fat")

gg_box_sodium <- food_common |>
  ggplot() +
  geom_boxplot(aes(x = dataset, y = sodium)) +
  ggtitle("Sodium")


(gg_box_protein + gg_box_fat) / (gg_box_iron + gg_box_sodium)
Warning: Removed 80 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 84 rows containing non-finite values (`stat_boxplot()`).

Figure 10: Boxplots comparing the distribution of various nutrients in the FNDDS and Legacy datasets

Let’s also compare the nutrient values of a few specific food items. For instance, here are the nutrient values of various nonfat Greek yogurt.

food_common |> 
  filter(str_detect(tolower(description), "yogurt"),
         str_detect(tolower(description), "greek"),
         str_detect(tolower(description), "fat"),
         str_detect(tolower(description), "plain"),
         str_detect(tolower(description), "low")) |> 
  transmute(dataset, description = str_trunc(description, 30), 
            protein, fat, carbohydrates, calories, calcium, sodium) |>
  arrange(dataset)
dataset description protein fat carbohydrates calories calcium sodium
fndds Yogurt, Greek, low fat milk… 9.95 1.92 3.94 73 115 34
legacy Yogurt, Greek, plain, lowfat 9.95 1.92 3.94 73 115 34
food_common |> 
  filter(str_detect(tolower(description), "soup"),
         str_detect(tolower(description), "minestrone"),
         str_detect(tolower(description), "can")) |> 
  transmute(dataset, description = str_trunc(description, 30), 
            protein, fat, carbohydrates, calories, calcium, sodium) |>
  arrange(dataset)
dataset description protein fat carbohydrates calories calcium sodium
fndds Minestrone soup, reduced so… 2.00 0.80 9.00 50 20 215
fndds Minestrone soup, canned, pr… 1.74 1.03 4.60 34 16 261
legacy Soup, minestrone, canned, c… 3.48 2.05 9.17 68 28 516
legacy Soup, minestrone, canned, c… 2.13 1.17 8.64 53 25 288
legacy Soup, minestrone, canned, p… 1.77 1.04 4.66 34 14 254
legacy Soup, minestrone, canned, r… 2.00 0.80 9.00 50 20 215
food_common |> 
  filter(str_detect(tolower(description), "pecan"),
         str_detect(tolower(description), "pie")) |> 
  transmute(dataset, description = str_trunc(description, 30), 
            protein, fat, carbohydrates, calories, calcium, sodium) |>
  arrange(dataset)
dataset description protein fat carbohydrates calories calcium sodium
fndds Pie, pecan 4.5 16.69 59.61 407 22 275
fndds Pie, pecan, individual size… 4.5 16.69 59.61 407 22 275
legacy Pie, pecan, prepared from r… 4.9 22.20 52.20 412 32 262
legacy Pie, pecan, commercially pr… 4.5 16.69 59.61 407 22 275

Overall, since we would expect some differences in the food items that people might use our app on and the underlying data in any of these datasets, these differences across the different datasets feel in line with the differences we would expect from the future data that our users would be inputting into our app.

We’re going to stop here, but feel free to conduct any additional explorations that occur to you!